Fri Oct 30, 2020
Here is the link: https://github.com/avaruusseppo/IODS-project
I have some experience with R already but yet not too familiar with the R markdown. Thanks to Kimmo for giving a hint about R bookdown. Also thanks for the peppy warm up speech!
# Let us do something in this "R chunk" .
plot(sin,from=-2,to=2,xlab="days after course start",ylab = "enthusiasm")
A small gRaph of my musings.
Fri 06 Nov 2020
Loading a prepared csv table, containing data from a learning study (International survey of Approaches to Learning, Kimmo Vehkalahti, fall 2016). Metainformation: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
lrn14<-read.csv("data/learning2014.csv")
print(dim(lrn14))
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
A quick overview of the contents in this data:
This is a learning study on a statistics students, observing the different factors that might relate to a high or low exam score (points) for students of different ages and genders.
The students were interviewed about their attitude towards statistics, and different motivational questions about their modes of learning.
The mode-related questions are grouped in three different categories (deep, surface and strategic). The category modes are averages of the numeric questionnaire results.
By examining the structure, we can see that the data has 166 observations in 7 variable columns.
The summary() function for a data frame in R shows the ranges where the values of variables reside.
# This show summaries of the variables in the data
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The learning mode parameters (deep, stra, surf) are averaged from a Likert style (1-5) questionnaire. The attitude is the subjective meter of interest towards learning statistics. The age is the student’s age in years.
The exam points (used as an output variable in this study) vary from 7 to 33. By looking at the 1st and 3rd quantile in the summary output, the score density could be expected to fit in the Gaussian bell curve.
Simple plot with linear fitting
We’ll try quickly a scatter plot with matching attitude to exam points. This is quite much what the excel chart wizards do (but nothing more).
# We need the 'ggplot2' graphics package and ggpairs (from GGally)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# A scatter plot of students attitude and exam points
qplot(attitude, points, col = gender, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
When we examine graphically the correlation of attitude to learning and the scores, we can find a positive correlation. What we can see here is that the gender is not too much a factor in the course scores. A great attitude would predict a high score so we would expect a positive correlation.
As a second example, we might be especially interested how the different estimated learning modes affect the score. For instance, the strategic learning shows out promising. A positive correlation is shown as a linear model fitted towards the upper right-hand corner.
# A scatter plot of students strategic approach on points, fitter by gender
qplot(stra, points, col = gender, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Now, instead of examining a single variable’s effect and distribution, we might want to have a graphical overview of them all, with GGally
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))
We continue building a linear model with multiple variables. This model is something that predicts the output variable (exam score) from input variables.
After examining the matrix plot above,I choose three variables: age, attitude and strategic approach) as explanatory input variables. A regression model is fitted to these variables where exam points is the dependent (output) variable.
Below is a summary of the fitted model and comment, with an interpretation of the R output.
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + age, data = lrn14)
The parameter “points ~ attitude + stra + age” is a so-called formula. It states the dependent variable on the left-hand side of the tilde (“~”) mark. On the right-hand side are the examined coefficients are listed. In this model, the coefficients are ought to be independent of each other.
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The summary() function for a lm() model shows correlations and F-scores to the coefficients, which are practically the input variables for our model.
At first R outputs a summary of the weighted residuals distribution. These are the error residuals between the estimated and registered output values (points).
The first listed coefficient (intercept) is a baseline, that is a fixed the mean of the output. The following coefficients are expected to affect the output at some correlation. The coefficient correlation is the “slope” of the fitted linear estimate for the effect of the input variable and output.
For each coefficient, R prints out a t-statistic (so-called Student’s test) and a p-value, which reflects the probability of significant correlation between the coefficient and the output variable. R pretty-prints nicely significance codes: E.g. "***" on a strong significance (p<0.001), and “.” on a smaller significance (0.05 < p < 0.1) . (Usually most social studies use the 95 percent confidence rule as the baseline but we can accept this model for now.)
The F-statistic is the so-called test of equality of variances. The smaller the p-value is, the more probable is the alternative hypothesis becomes. In other words a small p-value shows that the model has a meaningful statistical relationship.
The R squared is fraction of variance explained by the model. In other words, if this value is large, the model gives a good fit and there is little or no need to search for additional explaining variables.
The multiple R squared of the model is 0.2182. I would have expected to have a little higher R squared. This gives an impression there would be still more descriptive variables “out there” to be added to the model. Still the R squared is positive so the model is better than fitting a straight horizontal line.
In conclusion, this model can be understood so that the attitude variable has a very significant relationship with the exam score, having a p value < 0.001. This means that this alternative hypothesis is very string.
Also the age and strategic learning mode have somewhat significant relation on the output, since the probability of null hypothesis (ie. no significance) is less than 0.1. It seems the higher age would predict slightly lower exam score, and using strategic learning would produce some extra exam points. Such effect might be worth of further study.
R has this convenience function for plotting lm() models:
plot(my_model2,c(1,2,5))
These graphs help us understand and evaluate, how appropriate our statistical model for estimating the effect of student’s age, attitude towards statistics and their adaptation of strategic learning methods.
The Residuals vs fitted values graph shows a general visualization how well this linear model fits the data. If the residuals are uniformly scattered from the low to high end of fitted values, we can be pretty comfortable with these coefficients.
The Normal Q-Q plot is a quick visualization, on how well the error residuals distribution follows the Gaussian distribution. If most of the values are laid among the dotted diagonal, we could be assured there is no further systematic “explaining” variable to be examined.
The Residuals vs leverage graph is a useful tool for finding outliers in a model. An outlier is a data point that its output value is exceptionally far from the estimate that is produced from the variables. In other words, any points over the dotted red line would have high influence on the fitted model. Probably any of the points do not have specifically high influence on our model coefficients, and the residuals are quite normally distributed.
Fri 13 Nov 2020
From data/alc.csv file:
alc<-read.csv("data/alc.csv")
print(dim(alc))
## [1] 382 35
A quick overview of the contents in this data: This is a student performance vs alcohol usage survey data. Data is joined from two course survey data sets, scores averaged, from 382 distinct students with both Portugese and math courses. Added average numberic column alc_use for average alcohol consumption (likert 0…5) and logical column high_use (which is TRUE if alc_use is higher than 2).
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Data contains school performance scores (Grades after periods 1-3), absences, alcohol use and some binary background data such as school name, gender, age, home location type. Numeric likert-scale estimates (from 1 to 5 values) on family relations, freetime activity, going out and health. More information at: https://archive.ics.uci.edu/ml/datasets/Student+Performance * NB! added alcohol use columns (average and logical (high consumption))
Alcohol might be consumed at high rate if the initial G1 grades are low and family relations are at low level. Also stated male gender
and higher age might induce high alcohol usage.
A graphical summary:
library(GGally)
ggpairs(alc,
columns = c('alc_use','age','sex','famrel','G1'),
lower = list(combo = wrap("facethist", bins = 30)))
Some plots on these factors, first a bar plot of how the alc_use variable is distributed
g1 <- ggplot(alc, aes(x = high_use, y = G1, col = sex))
g1 + geom_boxplot() + ylab("age")+ggtitle("Student initial grades by high alcohol consumption and gender")
It seems like the age of high alcohol users is a bit higher, in all guardian types.
Then let’s see the age distribution within high and low users.
# initialize a plot of 'high_use'
g2 <- ggplot(alc, aes(x=(age), fill=sex))
g2 + facet_wrap("high_use") + geom_bar() + ggtitle("age distributions by high alcohol usage and gender")
Seems like the age distribution might be relevantly different in the high alcohol consumers.
g1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g1 + geom_boxplot() + ylab("famrel")+ggtitle("Family relation estimate by high alcohol consumption and gender")
These seem promising to the hypothesis model.
Now I’m about to do a fitting with the binomial distribution (high alcohol consumers vs the control group)
# find the model with glm()
m <- glm(high_use ~ age + sex + G1 + famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + G1 + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4499 -0.8193 -0.6569 1.1384 2.1599
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9968 1.8764 -1.064 0.2872
## age 0.1917 0.1007 1.903 0.0571 .
## sexM 0.9436 0.2378 3.967 7.26e-05 ***
## G1 -0.1141 0.0447 -2.553 0.0107 *
## famrel -0.3207 0.1262 -2.542 0.0110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 432.29 on 377 degrees of freedom
## AIC: 442.29
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age sexM G1 famrel
## -1.9968299 0.1916833 0.9435755 -0.1141214 -0.3207391
All the coefficients seem to be somewhat relevant to predicting high usage. This can be more easily understood as odds ratios, which are the exponents of the coefficients.
Odds ratios and their confidence intervals:
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1357650 0.003360881 5.3701291
## age 1.2112868 0.995467029 1.4791035
## sexM 2.5691511 1.620172718 4.1227535
## G1 0.8921496 0.815955982 0.9726699
## famrel 0.7256125 0.565422154 0.9289606
Analysis of the fitted model: It seems like the good family relations reduce the high alcohol consumption. Higher age and male gender seems also to be an inducing factor to high consumption. Higher grades at first period studies also give a hint towards lesser probable high consumption.
All in all, we’ll keep these coefficients as the factors of our prediction.
First, I create a prediction and probability column, based on the fitting.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
Adding the predicted probabilities to ‘alc’ and using the probabilities to make a prediction of high_use:
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
To understand and examine the prediction, let’s see the last ten original classes, predicted probabilities, and “high use” class predictions:
select(alc, age, sex, G1, famrel,high_use, probability, prediction) %>% tail(10)
## age sex G1 famrel high_use probability prediction
## 373 19 M 6 4 FALSE 0.6504559 TRUE
## 374 18 M 6 5 TRUE 0.5271286 TRUE
## 375 18 F 12 5 FALSE 0.1795082 FALSE
## 376 18 F 6 4 FALSE 0.3742059 FALSE
## 377 19 F 8 5 FALSE 0.2949394 FALSE
## 378 18 F 11 4 FALSE 0.2525945 FALSE
## 379 18 F 6 2 FALSE 0.5317729 TRUE
## 380 18 F 8 1 FALSE 0.5547198 TRUE
## 381 17 M 12 2 TRUE 0.5484541 TRUE
## 382 18 M 10 4 TRUE 0.4932191 FALSE
Now let’s tabulate the target variable versus the predictions:
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 249 19
## TRUE 93 21
We can be pretty OK with this. There is always a bit uncertainty but the model seems OK. A graphic evaluation of the probability and prediction results below:
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point() + ggtitle("cross-tabulation of prediction and high usage, with the probability parameter")
First let’s define a loss function (which is the average error in prediction).
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2931937
The computed average number of wrong predictions in the (training) data is around 0.29.
Next, we’ll do a ten-fold cross-validation with resampling
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2984293
The average number of wrong predictions in the cross validation is not too small (about 3 wrong out of 10) but we’re quite happy with this.
Seppo Nyrkkö Fri Nov 20, 2020
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
is about housing values in the suburbs of Boston. It comes along with the R MASS data set collection.
( Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. – Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley. )
# structure and the dimensions of the Boston data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
506 obs with 14 variables. All numerical.
# let's take only a sample, this takes some cpu cycles and produces a grid of tiny plots
ggpairs(Boston[sample(seq(1,nrow(Boston)),size=50),])
Some correlation can be seen in the scatter plots. Few distributions are normal, most are skewed or “double-bumped”, meaning there are clusters inside the variables.
# let us try the cor() matrix and corrplot()
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
The correlation plot shows a few interesting positive correlations (radial highways and property tax, nitrogen oxides and industrial acres) and negative correlations (between industrial acres and distance to central areas; and median home values and lower status of population). Seems quite ok.
# print out summaries of the scaled data.
boston_sc <- as.data.frame(scale(Boston))
summary(boston_sc)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The means are all zero as expected. There is also the unit variance per each variable.
Now let’s calculate the crime quantiles (from the scaled crime rate).
bins = quantile(boston_sc$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_sc$crim,
breaks=bins,
labels=labels,
include.lowest=TRUE)
boston_sc <- dplyr::select(boston_sc, -crim)
boston_sc <- data.frame(boston_sc, crime)
str(boston_sc)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
We have a crime factor, great. Now dividing the dataset to train and test sets: 80% goes in training
ind <- sample(seq(1,nrow(boston_sc)), size=nrow(boston_sc) * 0.8)
train <- boston_sc[ind,]
test <- boston_sc[-ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.033 1.015 -0.867 1.015 0.401 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 3.665 ...
## $ nox : num -0.3857 0.2184 -0.3426 1.1935 -0.0405 ...
## $ rm : num -0.606 -0.51 -0.592 0.265 0.126 ...
## $ age : num 0.00444 0.08615 -0.79133 1.07376 0.8464 ...
## $ dis : num -0.519 -0.421 0.682 -0.983 -0.205 ...
## $ rad : num -0.522 1.66 -0.522 1.66 -0.522 ...
## $ tax : num -0.666 1.529 -1.093 1.529 -0.785 ...
## $ ptratio: num -0.857 0.806 0.806 0.806 -0.949 ...
## $ black : num 0.4 0.132 0.441 0.387 0.406 ...
## $ lstat : num -0.422 0.767 -0.4 0.626 -0.302 ...
## $ medv : num 0.00731 -0.37325 -0.32976 -1.02563 0.0508 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 4 1 4 2 2 4 4 1 4 ...
dim(train)
## [1] 404 14
dim(test)
## [1] 102 14
train and test sets seem OK.
Using the categorical crime rate as target variable. Other variables as predictor variables.
model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2599010 0.2524752 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 0.95045645 -0.9185645 -0.11163110 -0.8838064 0.4650228 -0.8992426
## med_low -0.09624154 -0.2851190 -0.08484810 -0.5758931 -0.1428619 -0.3950983
## med_high -0.37626407 0.1398181 0.22945822 0.2912219 0.1455942 0.4212409
## high -0.48724019 1.0171737 -0.07348562 1.0677569 -0.4474009 0.8112300
## dis rad tax ptratio black lstat
## low 0.8578701 -0.6900668 -0.7021551 -0.45550608 0.39022636 -0.81719143
## med_low 0.3782003 -0.5465475 -0.4639872 -0.05424541 0.31664501 -0.15402757
## med_high -0.3700529 -0.3783630 -0.2981359 -0.21403621 0.09955017 0.00175394
## high -0.8647757 1.6375616 1.5136504 0.78011702 -0.84417874 0.94540275
## medv
## low 0.553618120
## med_low -0.007709118
## med_high 0.196730462
## high -0.723600438
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10316704 0.811757515 -1.01987469
## indus 0.01018009 -0.286104045 0.29975540
## chas -0.08638824 -0.108378592 -0.01351102
## nox 0.52324891 -0.662892219 -1.20924277
## rm -0.11065872 -0.092176778 -0.20863967
## age 0.23458930 -0.369875983 -0.21227998
## dis -0.08188494 -0.395031627 0.37841919
## rad 3.01956214 0.849319590 0.01054800
## tax -0.10263142 0.097262617 0.53496963
## ptratio 0.09926266 0.083500642 -0.29884397
## black -0.15270961 -0.007406165 0.07089236
## lstat 0.32461955 -0.322536505 0.31548137
## medv 0.24696096 -0.382313792 -0.16630550
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9470 0.0391 0.0139
Draw the LDA (bi)plot. Use color from the crime factor index.
classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)
Seems like the highest crime quantile is separable from other clusters. We can proceed to the model testing part.
# remove the crime factor from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
## Factor w/ 4 levels "low","med_low",..: 1 2 2 3 3 3 3 3 1 2 ...
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num -0.4872 0.0487 0.0487 -0.4872 -0.4872 ...
## $ indus : num -1.306 -0.476 -0.476 -0.437 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.265 -0.265 -0.144 -0.144 ...
## $ rm : num 1.227 -0.93 -0.392 -0.478 -0.498 ...
## $ age : num -0.511 1.116 0.509 -0.241 -1.395 ...
## $ dis : num 1.077 1.086 1.155 0.433 0.334 ...
## $ rad : num -0.752 -0.522 -0.522 -0.637 -0.637 ...
## $ tax : num -1.105 -0.577 -0.577 -0.601 -0.601 ...
## $ ptratio: num 0.113 -1.504 -1.504 1.175 1.175 ...
## $ black : num 0.441 0.328 0.441 0.441 0.331 ...
## $ lstat : num -1.0255 2.4194 0.0864 -0.6152 -0.8504 ...
## $ medv : num 1.486 -0.6559 -0.395 -0.2319 0.0617 ...
Test doesn’t have the correct answer - ok
Prediction of crime categories with LDA using the prepared test data:
predictions = predict(model, newdata=test)
# Cross tabulate the results with the crime categories from the test set.
table(correct=correct_classes, predicted=predictions$class)
## predicted
## correct low med_low med_high high
## low 15 11 3 0
## med_low 3 12 6 0
## med_high 1 5 18 0
## high 0 0 0 28
This produces quite a fair confusion matrix with a strong diagonal. Some confusion between low and med_low is apparent, otherwise this is very OK.
Let’s reload and standardize the Boston dataset
data("Boston")
# Scale the variables to mean zero and unit variance
boston_sc <- as.data.frame(scale(Boston))
# This is one of my favorite methods in R. (besides the heatmap)
dist_eu <- dist(boston_sc, method="euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Running k-means algorithm on the dataset.
# What is the optimal number of clusters? Try 4 ...
km <- kmeans(boston_sc, centers=4)
# Visualize the with a few variables as pairs
pairs(boston_sc[c('tax','dis','age',
'nox','indus')], col=km$cluster)
Ok.
Seems like there are some clusters found, since we see some connection between the cluster color and the xy-position on the plot.
Let’s run the k-means again to find an optimal number of k clusters.
# boston_sc dataset is available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_sc, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Seems like there is a drop at 2 clusters.
Let us do the clustering again with centers=2
# k-means clustering
km <-kmeans(boston_sc, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_sc, col=km$cluster)
Seems to be pretty ok. For a better picture, let’s look at the pairs data with the same variables as before:
# plot the Boston dataset with clusters
pairs(boston_sc[c('tax','dis','age',
'nox','indus')], col=km$cluster)
We can be pretty happy with this. There are two distinct clusters. Some bleeding can still be seen, in some pairs but the other dimensions show very distinct split between the clusters.
# Reload the original Boston dataset
data("Boston")
# Scale the variables again to normalized
boston_sc <- as.data.frame(scale(Boston))
set.seed(123)
# 3 clusters using k-means for targets
km <- kmeans(boston_sc, centers=3)
# LDA using the km clusters
lda.fit <- lda(x=boston_sc, grouping=km$cluster)
# Examine the object
lda.fit
## Call:
## lda(boston_sc, grouping = km$cluster)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2806324 0.3992095 0.3201581
##
## Group means:
## crim zn indus chas nox rm
## 1 0.9693718 -0.4872402 1.074440092 -0.02279455 1.04197430 -0.4146077
## 2 -0.3549295 -0.4039269 0.009294842 0.11748284 0.01531993 -0.2547135
## 3 -0.4071299 0.9307491 -0.953383032 -0.12651054 -0.93243813 0.6810272
## age dis rad tax ptratio black
## 1 0.7666895 -0.8346743 1.5010821 1.4852884 0.73584205 -0.7605477
## 2 0.3096462 -0.2267757 -0.5759279 -0.4964651 -0.09219308 0.2473725
## 3 -1.0581385 1.0143978 -0.5976310 -0.6828704 -0.53004055 0.3582008
## lstat medv
## 1 0.85963373 -0.6874933
## 2 0.09168925 -0.1052456
## 3 -0.86783467 0.7338497
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.03654114 0.20373943
## zn -0.08346821 0.34784463
## indus -0.32262409 -0.12105014
## chas -0.04761479 -0.13327215
## nox -0.13026254 0.15610984
## rm 0.13267423 0.44058946
## age -0.11936644 -0.84880847
## dis 0.23454618 0.58819732
## rad -1.96894437 0.57933028
## tax -1.10861600 0.53984421
## ptratio -0.13087741 -0.02004405
## black 0.15432491 -0.06106305
## lstat -0.14002173 0.14786473
## medv 0.02559139 0.37307811
##
## Proportion of trace:
## LD1 LD2
## 0.8999 0.1001
# now with arrows as in the datacamp task
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1,
color = "magenta", tex = 0.75, choices = c(1,2))
{
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(jitter(myscale * heads[,choices]), labels = row.names(heads),
cex = tex, col="black", pos=3)
}
# Plot with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster, ylim=c(-7,7),xlim=c(-10,10))
lda.arrows(lda.fit, myscale = 5)
Two possible separate clusters could be explained by: i) high value homes and large lots (by surface area) and ii) the closeness to radial highways.
This clustering seems very appropriate, and we can be happy with this.
Seppo Nyrkkö Fri Nov 27, 2020
library(ggplot2)
library(GGally)
# read human data, row names in first column
human <- read.csv("data/human.csv",row.names=1)
# examine the structure
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ ratiofm2edu : num 1.007 0.997 0.983 0.989 0.969 ...
## $ ratiofmlabor: num 0.891 0.819 0.825 0.884 0.829 ...
## $ eduexpy : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifeexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matmort : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adolbirth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parlrepr : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
summary(human)
## ratiofm2edu ratiofmlabor eduexpy lifeexp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni matmort adolbirth parlrepr
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The data has country as row name, ratio of 2nd education M/F, ratio of labor F/M, expected years of education, exp years of life, GNI of the nation, maternal mortality, adolescence birth rate and female parliament representation ratio.
Let’s see some ggpairs:
ggpairs(human)
There might be some positive correlation between GNI and education years, and life years. Maternal mortality and adolescence birth has inverse correlation with GNI.
hist(human$gni,breaks = seq(0,1,0.2)*max(human$gni),col='blue')
The highest GNI per capita rates are only in a few countries. Bear this in mind.
First, some PCA on the non-standardized data.
pc_human <- prcomp(human)
pc_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## ratiofm2edu -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## ratiofmlabor 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## eduexpy -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## lifeexp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## gni -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## matmort 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## adolbirth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## parlrepr -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## ratiofm2edu -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## ratiofmlabor 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## eduexpy 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## lifeexp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## gni -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## matmort 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## adolbirth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## parlrepr -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
summary(pc_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
See the large differences in deviances.
biplot(pc_human)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
This also gives a “not too clear” biplot. The data seems to be on a single axis: gni. (Might be caused from the fact that dollars and percentages are of different ranges)
We’ll continue to standardizing the variables, and do the PCA again.
stdpc_human <- prcomp(scale(human))
stdpc_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## ratiofm2edu -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## ratiofmlabor 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## eduexpy -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## lifeexp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## gni -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## matmort 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## adolbirth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## parlrepr -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## ratiofm2edu 0.17713316 0.05773644 0.16459453
## ratiofmlabor -0.03500707 -0.22729927 -0.07304568
## eduexpy -0.38606919 0.77962966 -0.05415984
## lifeexp -0.42242796 -0.43406432 0.62737008
## gni 0.11120305 -0.13711838 -0.16961173
## matmort 0.17370039 0.35380306 0.72193946
## adolbirth -0.76056557 -0.06897064 -0.14335186
## parlrepr 0.13749772 0.00568387 -0.02306476
Exploring some dimensions:
barplot(stdpc_human$rotation[,1])
title('PC1: high maternal mortality and adolescent birth rate')
barplot(stdpc_human$rotation[,2])
title('PC2: high female labour ratio and parliament representation')
summary(stdpc_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
biplot(stdpc_human)
Now the component arrows are distinguishable. Also the biplot is clerer to read as a “map”.
The PC1 component shows the dimension where GNI is low and maternal mortality and adolescent birth rate is high. The PC2 component shows a dimension where female labor is common but also parliament representation rate is higher.
# some libraries
library(FactoMineR)
library(dplyr)
library(tidyr)
Let’s use the tea dataset which is a survey of tea consumption. 300 individuals answered, how they drink tea (18 Q), what are their product’s perception (12 Q) and some personal details (4 Q).
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# reduced dataset
tea_segments <- dplyr::select(tea, c("Tea", "relaxing", "effect.on.health", "sugar", "age_Q", "sex"))
str(tea_segments)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
dim(tea_segments)
## [1] 300 6
summary(tea_segments)
## Tea relaxing effect.on.health sugar
## black : 74 No.relaxing:113 effect on health : 66 No.sugar:155
## Earl Grey:193 relaxing :187 No.effect on health:234 sugar :145
## green : 33
##
##
## age_Q sex
## 15-24:92 F:178
## 25-34:69 M:122
## 35-44:40
## 45-59:61
## +60 :38
Now let’s do Multiple Correspondence Analysis on the reduced tea data
# some visualization
gather(tea_segments) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 10))
## Warning: attributes are not identical across measure variables;
## they will be dropped
We can see our factors are populated enough in all categories, in order to make some further analysis of our reduced data set.
Now continuing to the MCA (which is kind of categorial PCA):
mca <- MCA(tea_segments, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_segments, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.271 0.228 0.207 0.182 0.172 0.151 0.134
## % of var. 16.232 13.667 12.445 10.919 10.329 9.038 8.038
## Cumulative % of var. 16.232 29.899 42.344 53.263 63.592 72.630 80.667
## Dim.8 Dim.9 Dim.10
## Variance 0.128 0.100 0.094
## % of var. 7.658 6.015 5.659
## Cumulative % of var. 88.326 94.341 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.266 0.087 0.030 | 0.441 0.285 0.083 | 0.533
## 2 | 1.002 1.237 0.572 | -0.333 0.163 0.063 | -0.224
## 3 | 0.311 0.119 0.083 | -0.564 0.465 0.273 | -0.331
## 4 | -0.817 0.822 0.643 | 0.023 0.001 0.001 | 0.122
## 5 | 0.237 0.069 0.044 | -0.018 0.000 0.000 | -0.227
## 6 | -0.380 0.178 0.142 | -0.187 0.051 0.034 | 0.098
## 7 | 0.012 0.000 0.000 | 0.001 0.000 0.000 | 0.402
## 8 | 0.494 0.300 0.121 | -0.436 0.278 0.094 | 0.830
## 9 | 0.012 0.000 0.000 | 0.001 0.000 0.000 | 0.402
## 10 | 0.420 0.217 0.082 | 0.110 0.018 0.006 | 0.935
## ctr cos2
## 1 0.457 0.122 |
## 2 0.081 0.029 |
## 3 0.176 0.094 |
## 4 0.024 0.014 |
## 5 0.083 0.040 |
## 6 0.016 0.010 |
## 7 0.259 0.094 |
## 8 1.107 0.343 |
## 9 0.259 0.094 |
## 10 1.404 0.408 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.828 10.412 0.224 8.190 | 0.125 0.281 0.005
## Earl Grey | -0.444 7.824 0.356 -10.318 | -0.187 1.647 0.063
## green | 0.742 3.735 0.068 4.513 | 0.814 5.336 0.082
## No.relaxing | 0.551 7.039 0.183 7.403 | 0.217 1.294 0.028
## relaxing | -0.333 4.253 0.183 -7.403 | -0.131 0.782 0.028
## effect on health | 0.190 0.491 0.010 1.748 | 0.417 2.799 0.049
## No.effect on health | -0.054 0.139 0.010 -1.748 | -0.118 0.789 0.049
## No.sugar | 0.659 13.823 0.464 11.782 | -0.291 3.198 0.090
## sugar | -0.704 14.776 0.464 -11.782 | 0.311 3.419 0.090
## 15-24 | -0.878 14.554 0.341 -10.094 | -0.735 12.115 0.239
## v.test Dim.3 ctr cos2 v.test
## black 1.235 | 1.127 25.157 0.416 11.147 |
## Earl Grey -4.344 | -0.330 5.628 0.196 -7.663 |
## green 4.950 | -0.597 3.145 0.044 -3.626 |
## No.relaxing 2.913 | -0.725 15.921 0.318 -9.749 |
## relaxing -2.913 | 0.438 9.621 0.318 9.749 |
## effect on health 3.829 | -0.110 0.212 0.003 -1.006 |
## No.effect on health -3.829 | 0.031 0.060 0.003 1.006 |
## No.sugar -5.200 | -0.032 0.042 0.001 -0.571 |
## sugar 5.200 | 0.034 0.045 0.001 0.571 |
## 15-24 -8.450 | -0.008 0.002 0.000 -0.096 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.357 0.099 0.422 |
## relaxing | 0.183 0.028 0.318 |
## effect.on.health | 0.010 0.049 0.003 |
## sugar | 0.464 0.090 0.001 |
## age_Q | 0.596 0.511 0.480 |
## sex | 0.013 0.589 0.020 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
This gives a rather nice plot to understand the correlation between young age group (15-24) and Earl Grey tea, which is the most commonly available blend in tea.
Black tea is common choice for all age groups over 35 years.
Sugar consumption is more common in the lower age groups, as the taste buds are not yet ‘driven in’, so to say.
If designing a set of tea flavors, this graph could give a hint to create product packages and images appealing to the targeted market segments.
Seppo Nyrkkö Fri Dec 4, 2020
library(ggplot2)
library(dplyr)
library(tidyr)
# Rats diet weighting data in long format
longrats <- read.csv("data/ratslong.csv")
# BPRS patient treatment scores
longbprs <- read.csv("data/bprslong.csv")
summary(longrats)
## ID Group days weight
## rat-1 : 11 Grp1:88 Min. : 1.00 Min. :225.0
## rat-10 : 11 Grp2:44 1st Qu.:15.00 1st Qu.:267.0
## rat-11 : 11 Grp3:44 Median :36.00 Median :344.5
## rat-12 : 11 Mean :33.55 Mean :384.5
## rat-13 : 11 3rd Qu.:50.00 3rd Qu.:511.2
## rat-14 : 11 Max. :64.00 Max. :628.0
## (Other):110
str(longrats)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "rat-1","rat-10",..: 1 9 10 11 12 13 14 15 16 2 ...
## $ Group : Factor w/ 3 levels "Grp1","Grp2",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ days : int 1 1 1 1 1 1 1 1 1 1 ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
# Some ggplots
# Display the data graphically, first by groups.
ggplot(longrats, aes(x = days, y = weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(longrats$weight), max(longrats$weight)))
Seems like the grouping is done and handled correctly. Time goes in the right direction and weight grows as the days pass.
Group 1 doesn’t gain weight as much as groups 2 and 3. An outlier rat in group 2 is a bit heavier but still belongs to the correct diet group.
Since the initial weights differ so much we can understand better the effects of the diets by standardizing the data
# Standardizing = zero mean, unit = standard deviance
longrats <- longrats %>%
group_by(days) %>%
mutate(stdWeight = (weight - mean(weight)) / sd(weight)) %>%
ungroup()
str(longrats)
## Classes 'tbl_df', 'tbl' and 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "rat-1","rat-10",..: 1 9 10 11 12 13 14 15 16 2 ...
## $ Group : Factor w/ 3 levels "Grp1","Grp2",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ days : int 1 1 1 1 1 1 1 1 1 1 ...
## $ weight : int 240 225 245 260 255 260 275 245 410 405 ...
## $ stdWeight: num -1.001 -1.12 -0.961 -0.842 -0.882 ...
This seems to adjust our weights per normal growth. Again, graphically:
ggplot(longrats, aes(x = days, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "std weight")
This looks much promising. The control group (1) seems to stay at the center, and two diet groups (2 and 3) seem to affect the weight at some direction as the time passes.
Now calculate the group means and errors.
# some fancy pipelining
ratgroups <- longrats %>%
group_by(Group, days) %>%
summarise(mean = mean(weight), se = sd(weight) / sqrt(length(weight))) %>%
ungroup()
summary(ratgroups)
## Group days mean se
## Grp1:11 Min. : 1.00 Min. :250.6 Min. : 3.604
## Grp2:11 1st Qu.:15.00 1st Qu.:269.5 1st Qu.: 4.809
## Grp3:11 Median :36.00 Median :486.5 Median :12.202
## Mean :33.55 Mean :424.7 Mean :17.149
## 3rd Qu.:50.00 3rd Qu.:518.2 3rd Qu.:34.903
## Max. :64.00 Max. :550.2 Max. :37.129
dim(ratgroups)
## [1] 33 4
Now we get a graphical analysis how the groups react to the diets per days run:
ggplot(ratgroups,
aes(x = days, y = mean, linetype = Group, shape = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="2"), width=1) +
theme(legend.position = c(0.8,0.8,0.8)) +
scale_y_continuous(name = "mean(weight) +/- std err")
## Warning in if (position != "none") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "left") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "right") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "bottom") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "top") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
To validate our rats, we can do the boxplot with error bars to visually determine whether our weight groups are okay or if they contain outlier rats.
ratbox <- longrats %>%
filter(days > 1) %>%
group_by(Group, ID) %>%
summarise(mean=mean(weight) ) %>%
ungroup()
ggplot(ratbox, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=1, size=8, fill = "white", col="red") +
scale_y_continuous(name = "mean(weight) days from 1 upwards")
summary(ratbox$Group)
## Grp1 Grp2 Grp3
## 8 4 4
Since there are only 4 rats in groups 2 and 3, it’s really hard to say if there are any outliers. We are happy with the one suspiciously heavy rat in group 2.
Let’s continue to the hypothesis testing then. The first question is whether the mean weight of group 1 differs from others.
# Two-sample t-test
ratbox$isGrp1 <- ratbox$Group=="Grp1"
t.test(mean ~ isGrp1, data=ratbox, var.equal=T)
##
## Two Sample t-test
##
## data: mean by isGrp1
## t = 12.628, df = 14, p-value = 4.848e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 201.416 283.834
## sample estimates:
## mean in group FALSE mean in group TRUE
## 507.650 265.025
This gives the 95% confidence range of group 1 as weighing from 202 to 284 grams.
Now adding a initial weight from the first day before the diets
# Since we are interested in the growth profile of these rats we take in account
# the day 1 weight
ratinit <- longrats %>%
filter(days == 1) %>%
group_by(Group, ID) %>%
summarise(initialW=mean(weight) ) %>%
ungroup()
ratbox$initialW=ratinit$initialW
fit <- lm(mean ~ initialW + isGrp1, data=ratbox)
fit
##
## Call:
## lm(formula = mean ~ initialW + isGrp1, data = ratbox)
##
## Coefficients:
## (Intercept) initialW isGrp1TRUE
## 86.4844 0.8751 -40.7937
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## initialW 1 253625 253625 1806.5913 2.448e-15 ***
## isGrp1 1 690 690 4.9159 0.04505 *
## Residuals 13 1825 140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems like the group 1 has a 95% confidence to be of a different growth profile. The visual examination supports this interpretation.
str(longbprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: Factor w/ 2 levels "T1","T2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 40 levels "G1P1","G1P10",..: 1 12 14 15 16 17 18 19 20 2 ...
## $ weeks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
summary(longbprs)
## treatment subject weeks bprs
## T1:180 G1P1 : 9 Min. :0 Min. :18.00
## T2:180 G1P10 : 9 1st Qu.:2 1st Qu.:27.00
## G1P11 : 9 Median :4 Median :35.00
## G1P12 : 9 Mean :4 Mean :37.66
## G1P13 : 9 3rd Qu.:6 3rd Qu.:43.00
## G1P14 : 9 Max. :8 Max. :95.00
## (Other):306
glimpse(longbprs)
## Observations: 360
## Variables: 4
## $ treatment <fct> T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1,…
## $ subject <fct> G1P1, G1P2, G1P3, G1P4, G1P5, G1P6, G1P7, G1P8, G1P9, G1P10…
## $ weeks <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
summary(longbprs)
## treatment subject weeks bprs
## T1:180 G1P1 : 9 Min. :0 Min. :18.00
## T2:180 G1P10 : 9 1st Qu.:2 1st Qu.:27.00
## G1P11 : 9 Median :4 Median :35.00
## G1P12 : 9 Mean :4 Mean :37.66
## G1P13 : 9 3rd Qu.:6 3rd Qu.:43.00
## G1P14 : 9 Max. :8 Max. :95.00
## (Other):306
Let’s visually analyze the BPRS study groups:
ggplot(longbprs, aes(x=weeks, y=bprs, linetype=subject)) +
geom_line() +
scale_linetype_manual(values=rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS", limits=c(0,100)) +
theme(legend.position="top")
Okay this looks really good. At least the data is read in correctly and this correlates to what we read in from the original bprs survey data.
We cannot be exactly sure whether the scores go down with the treatment T1 group, but let’s continue with the analysis. The variance in T2 scores makes it difficult to visually compare the trend during the treatment weeks.
Let’s try some regression modeling
bprsreg <- lm(bprs ~ weeks + treatment, data=longbprs)
summary(bprsreg)
##
## Call:
## lm(formula = bprs ~ weeks + treatment, data = longbprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## weeks -2.2704 0.2524 -8.995 <2e-16 ***
## treatmentT2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
This doesn’t show yet any significance of the treatment, as regards to the group means, but the weeks undergone seems to be a significant variable.
Since the observations are from repeating individuals, we would like to introduce some mixed models here. We can try to see whether we allow the unique subject id to count as a random effect:
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Random intercept model per subject:
bprs_sub <- lmer(bprs ~ weeks + treatment + (1 | subject), data=longbprs, REML = F)
summary(bprs_sub)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ weeks + treatment + (1 | subject)
## Data: longbprs
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## weeks -2.2704 0.1503 -15.104
## treatmentT2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) weeks
## weeks -0.256
## treatmentT2 -0.684 0.000
# Now random intercept and ALSO A random slope model
bprs_subweek <- lmer(bprs ~ weeks + treatment + (weeks | subject), data=longbprs, REML = F)
summary(bprs_subweek)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject)
## Data: longbprs
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## weeks 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## weeks -2.2704 0.2713 -8.370
## treatmentT2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) weeks
## weeks -0.545
## treatmentT2 -0.593 0.000
Now let’s run an ANOVA test against these two random effects models
anova(bprs_sub, bprs_subweek)
## Data: longbprs
## Models:
## bprs_sub: bprs ~ weeks + treatment + (1 | subject)
## bprs_subweek: bprs ~ weeks + treatment + (weeks | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_sub 5 2582.9 2602.3 -1286.5 2572.9
## bprs_subweek 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We would like to conclude that the model which takes the subject id and the personal slope as a random effect fits the data quite well!
Let’s plot the fitted curves to examine this visually
# Compute the fitted values
Fitted <- fitted(bprs_subweek, longbprs)
# Add a new column for the fitted data
bprslongfit <- mutate(longbprs, bprs=Fitted)
# Draw a plot of bprs (fitted and observed)
ggplot(bprslongfit, aes(x=weeks, y=bprs, group=subject)) +
geom_line(color="blue") +
geom_line(data=longbprs,color="red") +
scale_x_continuous(name="weeks") +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS fitted (blue) vs observed (red)") +
theme(legend.position="top")
This comparison helps understanding the random effects rather well.